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AUTOMATIC CONFIGURATION OF A MICROPROCESSOR 
TECHNICAL FIELD 

The present invention is in the field of digital computing systems. In particular, it relates to the 
automatic configuration of a microprocessor architecture. 

BACKGROUND ART 

For a new processor Instruction Set Architecture (ISA) to be successfiil high qualit}' 
development tools and a wide range of application supporting that ISA is required. Compilers 
must be made available that target the architecture along with the associated libraries and 
linker. A debugger is required to allow programs to be debugged while running on the 
architecture. Modem debuggers need to support symbolic level operation so that code can be 
executed with a view of the original source code. Software engineers expect an integrated 
development environment that ties the compiler and debugger tools into powerfiil GUI based 
environment. If software engineers cannot work in a familiar software environment dien this 
represents a significant barrier to the adoption of a new architecture. The development of 
such an environment and associated tools represents many man years of development work 
even if existing compilers and tools can be retargeted to the new architecture. 

Software developed in high level languages can be recompiled for execution on a new ISA. 
However, in practice, this can require significant effort. Moreover certain types of application 
software such as Operating Systems have strong architectural dependencies which make 
porting to a new ISA much more difficult 

There has been a general trend within the microprocessor industry to develop new 
generations of faster microprocessors tiiat are backwardly compatible with existing ISAs. This 
significandy eases the adoption of new product generations. However, supporting an existing 
ISA in a new architecture creates significant hardware overhead especially if the intention is to 
extract significant parallelism firom code. This overhead is particularly significant for 
microprocessors used within embedded systems where cost is highly significant. 

It is advantageous to be able to support an existing ISA on a new microprocessor without 
hardware overhead. This can be achieved using instruction set translation. The ISA of a host 
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It is advantageous to be able to si:qppoit an existing ISA on a new mictoptocessor witbout 
haidwate ovetbead Hiis can be acbieved using insttudion set ttanslation. Hie ISA of a host 
microptocessof is converted into tbe ISA of a pardcukr tatg^ mioDptocessor. Hiece is a 
sigoiiScant body of piior art in the area of instruction set tcan^don. A number of ar^r^fmir 
and commercial .systems have been built that allow binaries \mtten for one architecture -to be 
executed on another. One significant challenge is achieving hi^ enougjh performance on the 
taig^ ardiitecture. The precise emulation of the idiosyncrasies of an architecture on another 
significant^ degrades performance. 

The simplest method is interpretive emulation. A soft CPU is built on the taig^ ardiitecture 
that is able to read and interpretive^ executes the instructions fix>m the host architectiire. 
Unfortunately this method is very sbw and ine£Bcient and is largely impractical for use in 
embedded systems. Moreovei^ this method does not allow the translated code to make 
efiecth^ use of die particular architectural features of the targpt 

The majority of recent research and commercialization in this field has been in the area of 
dynamic translation techniques. This method allows a very exact emulation of an architecture 
to be achieved wbUe maintaining higjh performance levels. As code firom the host architecture 
is encountered it is converted, at run time, into code for the target architecture using a 
dynamic code translator. The translated code can then be stored in a cache. The translated 
code can then be executed to produce the required results. If the same block of code needs to 
be executed then the translated versfon firom the cache can be used again without the need to 
translate it again. In some systems an increasing amount of time is devoted to . performing 
optimisations on a particular code sequence in the cache if it is firequently executed. Hius die 
run time system can target computational^ expensive optimisations on firequentfy executed 
code. Dynamic translation systems can provide very exact emukttions of architectures, even 
for events that are normal^ very difficult to handle in translation. For instance, self modii^g 
code can be handled simpty by flushing any afifected code 6x>m die cache. Instructions that 
gqierate exceptions (such as memoiy accesses that generate a pag^nsiss) can also be handled 
and produce a machine state identical to that of the host architecture. Breakpoints can be 
handled as exceptions so that if a brea]q)oint is encountered execution can be made to stop at 
a particukr host instruction. Sin^ stepping is achieved by producing translated code blocks 
consisting of a sing^ host instruction. An example of a commercial^ available dynamic 
translation system is diat provided by Transmeta Inc. They have designed a soft x86 processor 
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that actua% tuns on a VLIW a£chitectute» by utilisation of dynamic txansktion tedmiques. 
More lecentfy Ttansitive Technologies have announced a mote gpietal technobgy that allows 
dyoainic translation between a number of dififerent embedded processor architectures. 

Dynamic translation is less suitable for embedded computing environments. Firstfy, there is a 
significant memory overhead created by the translator itself and die size of the cache required 
in order to achieve good performance. Secondty, dynamic translation systems do not provide 
sufiBciendy deterministic behaviour. Determinism is espedally important for embedded real 
time environments. There is a si^iificant start up delay while code fix>m die application is 
fi^nf^latpH into the cache. Thete may also be significant delays if an important block of code 
becomes evicted fix>m the cache. 

There is also benefit to the end user being able to extend the ISA of a particular processor. 
This enables fast custom hardware for a particular application domain to be directly accessible 
firom software. Some existing configurable RISC processors (such as those svipplied by 
Tensilicainc and Arc Cores) have a fedlity to extend the instruction set A number of unused 
operation codes are made available and are used to select an added instruction. The instruction 
execution log^c has to be int^rated into the pipeline of the processor in order to receive 
operands and write results back into the register file. This integration is more automatic in the 
case of the Tensilica solution. Both the Tensilica and Arc processors have their own 
instruction set and tool chain The tools can be updated so that die new instmction can be 
accessed throiig^ the compiler and assembler using a user specified mnemonic 
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SUMMARY OF INVENTION 

Hus document discloses a process Sot automaticaD^ configqiing a miciopiocessor using an 
existing ISA. Hiese •mictx^ptocessois ate targeted at embedded systems applications that 
execute tepetitive code tiiat contains V>tgh d^tees of potential parallelism. Hie 
niicroprocessois ate configured and ptogtammed automatically by die analysis of die 
application software in die form of an executable image in the ISA of a particular host 
xxiicroprocessor. The configured microprocessors have a customised taigpt ISA tiiat is 
specifically designed to eq>loit parallelism in the application software. 

Hie instmction set translation operates by converting each source machine itistmction into a 
sequence of more basic operations to be executed on a target ptocessor. All t^jisters reads 
and wdtes are formed into separate opetations. Ihus a 3-addtess add opetation is converted 
into operations to read die left and dg^t operand registers, the add operation itself and finally 
an opetation to wdte back die resuk to the register file. Instructions widi complicated 
addressing modes or that modify die condition codes result in longer sequences of basic 
operations. 

One disadvantage of ptior art static translation systems is that they ate unable to support 
debugging using the host instmction set It is obvious^ imperative to support esdsting 
debuggers. Hie innovations in the translation approach are primarily in the mediods to aOow 
such debug support By maintaining a correispondence between die host processor state and 
the coprocessor state at specified points in die execution it is possible to support host level 
breakpoints on the architecture. In odier words a breakpoint can be set specified by a host 
instruction address and the architectural state reproduced as diough die code was actually 
running on die host processor. 

Instmction set extension is supported by converting calls to particular software functions into 
an invocation of an extension hardware unit A hardware unit is designed that performs die 
same opetation as a particular software function. That is, it takes die same parameters and 
produces the same results as the code in its software equivalent The advantage of die 
hardware version is that it will be able to produce the results more quidcfy. 
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The configmed taiget mictopiocessois may be used as coptocessois ^tim a system. Diey ate 
lesponsibk for.executbg certain software fimcdons transbtFad fiom an executable imag^ of 
another host micioprocessor. Hiis host microprocessor will lypica% also be present in the 
system. Mechanisms are provided to allow die host microprocessor and taiget coprocessors to 
interact and maintain coherence between the memory systems. 
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BRIEF DESCRIPTION OF DRAWINGS 

Egure 1 provides an illusttatioa of a functional unit with a number of cycles latency between 
the consumption of input opetands and the g^etation of an oulpxit result 

Figure 2 provides some C code wbidi contains an example of the use of intrinsic functions. 

Hgute 3 provides an overview of the hardware used to tcaiisfbm 
host address. 

Figure 4 provides an example of the mapping of an indirect function call on the host 
architecture to an indirect function call on the taigpt 

Figure 5 provides an example of a direct function call on the host and how it is translated into 
a call on the target 

Figure 6 provides an overview of die design flow used witii the tool 

Egure 7 illustrates how collision pointers are used to handle multiple links that map to the 
same address in the target code area. 

Figure 8 provides an overview of the taiget data format 

Figure 9 shows die interactions between die host processor, coprocessor and memory in the 
context of die host processor running autonomous^ vMg the coprocessor is active. 

Figure 10 provides an overview of the connectivity of a host processor and coprocessors in a 
system and how they are connected to a debug environment on a remote system. 

Figure 1 1 shows the interactions between die host processor; coprocessor and memory in the 
context of die host processor being blocked vdiile the coprocessor is active. 

Figure 12 provides an overview of the tai^ address format 



wo 2004/003730 



:T/GB2003/002778 



7 

DESCRIPTION OF PRESENTLY PREFERRED EMBODIMENT 

Prooessof Synthesis Flow 

The retaigeting oF existing compilets and debu^^ for the prefenred embodiment of an 
automatically configured ta^et processor is pardcularty problematic, as it has no fixed 
instmction set Hie processor of the preferred embodiment does not support any kind of 
assembfy language. Existing compiler and debiig tools are designed to be.taigeted at a fixed 
architecture and require extensive modification to cope \rith a variable architecture. A fixed 
intermediate representation is needed to hide the ^miability of the architecture fxom existbig 
software tools. 

A particukr tool trajectory has been chosen for the preferred embodiment to minimise the 
amount of development effort and to promote adoption of the processor. The trajectory uses 
existing, and therefore familiar, software development took The fixed intermediate 
representation is in fact the machine code for an existing processor. Hius all the compiler 
tools and debug tools for that particular architecture can be used. The code generator takes an 
executable for the processor as an input and produces an executable for a customized 
processor as the output 

The translation must be able to take code g^erated for a host architecture and produce code 
for a particular processor. This code must fidthfuUy reproduce the same results as the origuial 
code. However, the focus of the preferred embodiment is to provide si^enor performance on 
particdady key parts of application. This superior performance is obtained through the 
e3q>loitation of hi^er levels of parallelism. Thus the dock firequency of the system could be 
lowered to achieve the same level of performance and dius reduce power consumption. Even 
thou^ this application code is expressed in sequential machine code the code generator must 
be able to reorder and schedule the individual operations as required to make effective use of 
the iimovattve architectural features supported by the preferred embodiment Thus individual 
operations may be scheduled in a completely different order to that of the odg^ sequential 
code. 

The preferred embodiment consists of a number of individual tools that will be used by 
eng^eers. The overall design chain is subdivided into these tools to provide greater flexibility 
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and improve iatetopmbjlity wilit systems supplied by odier Hecttonic Design Automation 
(EDA) software vendors. 

The overall tool chain and relationships between the tools is shown in Figure 6. The box 601 
represents the processor generation tooL TTiis takes as it^ut executable code for a host 
processor 608 and various configuration files 609. Alternatively the tool 601 may provide a 
graphical user interfece allowing direct control of configuration parameters £tom within the 
tool The tool reads and writes desaqstions 602 of candidate processor architectures. In one 
possible flow through the tool a hardware descr^tion of a processor may be generated 61Z 
This enters a standard hardware synthesis and place and route flow 603. In addition to the top 
level hardware description produced by the tool this may also incorporate library hardware 
elements 604. The output of die flow 603 is hardware 614 that can be used to construct a 
target system 605. 

In another possible flow the tool 601 may generate microcode 613 for a processor that has 
been previoxisly generated Hie architecture of die processor will be stored in an architecture 
description"602. 

Alternatively, the tool may be used to generate cycle accumte models of the hardware 615. 
Advantageoudy, these may be generated before hardware generation 612 to allow accurate 
simulation of a processor architectore before committing the design to tbe hardware flow 603. 
Hie models may be generated as native code that may be run on a host machine. Ibe code 
may be compiled using the compiler 610 along with sofi^grare modek 607 of the hardware 
bbcks 604. The resulting software may be linked using a linker 611 to produce a simulation 
606. This simulation may also be linked with other models to provide a complete system level 
model if required 

Instiuction Set EatteDaon 

The purpose of die software/hardware iriterfece is to allow engineers to specify the boundary 
between hardware and software in a system. Software languages do not normally have to 
provide any &dlity for specifying that boundary. It is an intrinsic assumption that the 
undedyiiig processor hardware will support a set of basic operations (such as addition, 
memory access etc) that makes implementation of die software possible. All software is 
converted into a sequence of such operations by a compiler. 
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Hie prefetted embodiment also has intrinsic haidwate sippott for such operations. Ihe 
support covets the hatdwate units tequiied for implementation of all the instructions for the 
processor machine code used as input to the tools. Howevei^ the hardware units in the 
processor may be extended as requited to itnplement more specialised funcdons for particular 
operations. Effectively, die processor of the preferred embodiment has a complete^ 
extensible instruction set 

Any pardcukr software function may be annotated to indicate that it is actua% implemented 
in hardware. Software functions are used as an abstraction for an operation actually 
implemented in hardware. Such functions must not have any side effects, such as tiie 
alteration of ^obal variables or other areas of memory, since such operations catmot have a 
direct implementation in a hardware unit Each function takes a number of parameters and 
produces one or more results based diredfy on those input parameters: 

Calls to certain functions can be replaced with uses of user specified block of hardware. The 
engineer adds the function to a list of functions that are implemented in hardware. Hie 
hardware function is given the same name as the equivalent software function. During 
translation, whenever a caU to the software function is encountered it is converted into an 
invocation of the hardware unit The parameters that woiald be passed to the software 
function are passed as the operands to the hardware unit The results that would have been 
returned by the software function are obtained as the results from the hardware vinit In this 
way the effective instruction set of the preferred embodiment processor can be extended as 
required The hardware functions can be accessed directiy from a higji level language just by 
calling die appropriate function. Moreover, this can be achieved without having to modify or 
extend the fixed instruction set of the host architecture. 

Rgure 2 shows an example usage of the mtdnsic mechanism. The exan^le provides a 
hardware implementatk)n of a bit counting function 201. This can be performed very 
effidendy in hardware but is much more time consuming in a software implementatioa The 
bit count function is called in the code sclent 202. If the bit counting function is marked as 
being intrinsic then the hardware unit will be used wiien code 202 is targeted througji the tool 
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Hie example iOustcates the po^^ of the mediodobgy. Within a few fines of code the user has 
defined a custom instmcdon for a processor. Hieteis no need to resort to assembly language 
or any complex definition language. What is more, the progcam is complete^ standard 
C/C++ and is easity readable by any programmer. 

User defined hardware units do not have to be purely computational in nature. For instance, 
fiincdons can be written to read and write to an array. This corresponds to an addidonal. 
memory unit in the hardware of the processor. This is especially useful if extra memory units 
need to be defined to improve overall memory bandwidth for certain applications. 

Ihe sofiware form of a fimction that is implemented in a hardware unit forms a behavioural 
model That is, it describes the opecadon of the execution unit The behavioural code is 
expected to produce exactly the same results that the real hardware would. Such code is 
executed during simulation. The code may access 1/ O or library fimctions that would not be 
present on a target system. This aDows the easy capture of trace information fix>m execution 
units. Particular execution xinits m^it represent I/O units in die real system. These units can 
generate the appropriate stimulus required for die ^ulation. 

Ihe actual implemen^don of the hasdware is generated sq>arately firom the behavioural 
implementation. Any development methodology may be empbyed as long as the behavioural 
model and hardware implementation remain equivalent Normally, an in^kmentation is 
obtained by rewriting the software version into HDL It can tiien be synthesized to generate 
an actual hardware implementation. Each executbn unit onfy implements a fine gcain 
component in the ovetaH system so they are simple to verify. 

Executiicm Unit Modd 

A single hardware execution unit may implement one or more individual functions. Eaxh of 
these fimctions is termed a method of the unit This corresponds to the terminology of a class 
encapsulation used in C++. Indeed, if C++ is used as the language to program die 
architecture then classes may be directly used to model a hardware unit, with the fimction 
members corresponding direcdy to these methods. 
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Egute 1 shows die basic model of an execudon unit 103. The undedyiog model of an 
execution unit is as a synchtonoiis, pipelined unit This fits indth the computational model 
of units ^widiin apiocessor. A unit is able to accq>t a number of opetands 102 on aparticdar 
clodc cycle and \rill'pjroduce £esult(s) 104 a number of dock cycles later. This dday is referred 
to as the latency of die unit and is illustrated as 106. It is expected that die unit is able to 
accept a new operation on evety dock cyde. If necessaty a blockage can be set for the unit 
that prevents it aocq^dng another operation for a certain number of dock cydes after the last 
one. Operand data 101 is supplied fix>m odier execudon units in die architecture and result 
data 1 05 are supplied to odier units. The widths of opetands and results are fulfy configurable. 

CodeTcanslatim 

The ttansladon must be able to take code generated for a particular host architecture and 
produce code for a tatgpt processor. This code must fiddifaDy reproduce die same results as • 
that host machine code. However, die' focus of the preferred embodiment is to provide 
supedor performance on particukdy key parts of an application. . 

In die translated code certain sequences of operations may be considered to be atomic That 
is, die execution of the target processor will never stop part way througji such a sequence. 
Therefore any intermediate processor state occurring during the execution of such an atomic 
block cannot be visible externally to the targpt processor. Such a sequence is hereafter referred 
to as an atomic block. 

Register Representation 

A processor of the preferred embodiment has a central renter file tiiat is used to hoki values 
that are written to r^^sters in the host code. There is largely a one-to-one correspondence 
between diese r^jisters and those present in the host ardiitecture. 

Only those reg^r values that are live at the condusion of an atomic block to be stored into 
the corresponding register. A r^^ster is live if it mi^t be subsequently read in die program. 
Temporary uses of reg^ters withki an atomic block do not have to be rq>roduced in die 
renter file. Hius the amount of renter file trafiBc can be significant^ reduced in comparison 
to the host architecture. 
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Hie t p?<" tegjlstets ate dkecdy equivalent to diose present in the host architecture. Hie 
majority of RISC host architectures have dther 16 or 32 r^^sters of 32 hit ^dth. Hie same 
number of registers are present in the taiget processor register file. 

Typically a host processor \dfl have a condition code r^jster. This holds status hits generated 
as a result of certain arithmetic or other operations, such as carry and overflow etc Hiese 
status bits must also be preserved in a central register. A^in, however, they only need to be 
preserved if a reg^ter is live at the end of an atomic block. 

Instruction Translation 

This section describes how individual instmctions in the host architecture are translated into 
sequences of operations for execution on the tai^t architecture. The descriptions are based 
on the mechanisms used for translating a typical RISC instruction set The general techniques 
for translating one instruction set to anodier are well known in the prior art 

Branches 

The branch itself is translated into two separate operations. Firstfy there is an immediate load 
that sets the destination address. The actual value is set Nidien the final binary is being written 
and the exact address has been determined. The second operation is the bxanch kselE The 
imm^i>fff value is passed to the branch unit This value is then passed to a branch conttx>l 
unit 

Hardware Function Calls 

If the host code contains a call to a function marked as being implemented in hardware dien 
the call is translated into a use of the hardware unit The software parameters are passed as die 
operands of the hardv^are operatioiL Hiere will be a direct correspondence between the 
software function parameters and those that must be passed to the hardware unit. 

The Application Binary Intetfece (ABI) of die host architecture will define how parameters 
must be passed to a function. This information is used during the translation process so that 
the locations of tiae parameters are known. In general the first few parameters are passed in 
fixed registers and later parameters passed in fixed locations on the stack firame. 
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Code is generated to read each of die teqoked parametets horn die appropdate t^^ster. I^ter 
parameters are read from stack frame locations as required Hiese loaded parameters are dien 
passed as operands to die hardware mediod. 

If die software function provides a return result then this must be emulated ficom die 
hardware cafl. A function call resuk is normally returned in a particular frsed r^ter. Code is 
generated to copy the result from die ^propriate resuk port of the hardware unit to die 
r^jister. 

Some parameters may be marked as output parameters corresponding to pointers (or 
reference parameters) to hold results from die functiotu Code is generated to obtain the 
parameter, representing the destination address, and generate a store of die resuk port to the 
address. The wrapper code generated around die use of die hardware unit dius allows the 
hardware unit to provide the same behaviour as a software fiinction implementation. 

Softtwate Funcdon Calls 

A software fiuiction call is similar to a branch operation excq>t that a link reg^ter is set prior 
to the call. Thie link register holds die return address &om die calL In die host instruction die 
link register may be impliddy set from die next PC vakie as part of the instmctioa operation. 

In the translated version die link reg^ter is loaded vniii an immediate value representing die 
address of the instruction folbwing die call in die origpal program. This is die return location 
and can be mapped via an address link in the translated image. The immediate vabe is written 
to die link register prior to die actual call The call is implemented as a load of die destination 
address, followed by a branch operation. 

Data Processing Instructions 

A particular host architecture will support a number of data processing operations. For a 
RISC architecture these will typically use a 3-address format where a left and ri^t operand is 
specified abng with a destination r^ter. Some operations (such as compares and tests) do 
not actually cause a write-back to a r^^er. Addressing modes may be available to aBow 
immtvWaff^^ rcglstcr or shifted values to be specified, for instance. The instructions may 
optionally write to die condition code reg^ter. 
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ilie individual instructions ate tcanshted into a number of separate operations on die tai^ 
architecture. The sequence of operations required is dq)endent upon any addressing mode 
used Code is first gpnetated to load operands fixjm die central r^^ster file. This is followed 
by the tranQlqtpH data processing operation itself The majority of instructions map to a single 
data processing operation. If required then an operation is generated to write the result back 
to the destination register. If the mstruction if)dates the condition codes then fiirther 
operations are generated to iqpdate die aflfected condition code r^^sters. Thus a sequence of 
operations is g^erated that produce the same efifect as the original host instruction. 

Thus a sing^ host instruction is translated into a number of individual operations. However, 
in general the later code optimisation phase will be able to eliminate many of die register file 
accesses to allow operands to be passed direcdy between the functional units. 

Any read of the PC reg^ter (if architecturally visible) is handled specialty. Such an operation is 
generally used to calculate the address of a data item in a position independent manner. The 
full immediate value after addition is calculated and then a sing^ operation is generated to load 
it via an immediate unit 

Memory Access Instructions 

Typically memory access instructions may support a number of addressing modes. The code 
sequence generated is dependent vpon the address mode used for the host instruction. This 
allows an address to be automatical^ incremented or decremented as part of the access 
instruction without the requirement for additional address iq>date instmctions. 

These addressing modes and updates must be subdiivided into their constituent operations. 
The memory access unit uses the final computed address as its operand In the case of pre- 
indexing the address is calculated and then written baci to the base register if required. This 
address is then used for the access. In the case of post-indexing the address is simply formed 
fix>m reading the base register. The access is performed and dien the fiill address is calculated 
and written back to the base r^teL 

Block Memory Instmctions 

The block memory instructions allow multiple words to be loaded or stored to memory with a. 
single host instruction. The behaviour of such an instruction is imusual in that it does not 
conform to die general principles of RISC instruction implementatiocL It takes a variable 
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number of dock cycles to execute dq)ending upon the number of r^Jstets that need to be 
stored or loaded The mult^le word access instructions are commonty used in function 
prologues and epilogues to save and restore volatile r^jisters on the stack fi:ame. 

Such block memory instructions are translated into multiple operations in the target 
architecture. The base register is read and then for each individual access (as determined by 
the r^ter list in the host instrucdonj a memory operation is generated An individual 
addidon to the base address, using an immediate ofiEset, is generated for each access. 
Individual oflEsets are generated radier than continually incrementing/decrementing a sin^e 
address value. This improves ficeedom to aflow the memory accesses to be more eaaly issued 
in paraM viiAx other operations. 

Translated Code Storage 

The static translation process occurs as a post-link operation. The intention is that tiiis is called 
automatically fix>m the host software development environment If the software IDE does 
not support the calling of a post-link operation then a scrpt can be used that incorporates 
both the link and the call of the processor code generation tool 

Since the tool is run after linker it operates on a conoplete executable. There are no unresolved 
refisrences and the locations for all data sections are determined. No support is provided for 
any kind of dynamical^ linking^ as such sq>port is less important in embedded development 
environments. 

The executable image provided to the tool shouU not be str^sped of the function symbols for 
the fanctions that are bdng translated. If necessary all other symbols may be stripped fixMn the 
executable image in order to save space. 

The translation takes the executable image and generates a new executable image that contains 
Ate translated code. A new section is simply appended to the executable. From the perspective 
of the host processor this is simply a static data section. It contaiiis all of the translated code 
for the taiget processor. Since exacdy the same format is retained for the executable knag^ the 
standard tools can be used to download both the host processor and the targeted processor 
irnage to the system. Moreover, die imagp can be read as normal by debuggers ki order to 
support symbolic debug. 
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The appended Taiget Code Area (TCA) is a contiguous block of memory that holds the code 
for the taiget processor. It also holds a mz^ping table that is used to transform host addresses 
into target code addresses. Ihis mapping mechanism is required for making debug of 
generated processors con^atible ^th existing host processors. 

Target Code Area Base Address 

The TCA has a base address widiin die virtual address space of die host processor, TTiis may 
be expliddy set as a configuration parameter or, akemativdy, an address may be selected diat 
follows on ficom the end of the existing program secdort 

The base address is stored within a data table within the executable so that the host processor 
is able to store the base address into a target processor r^ter named Taiget Code Base 
(TCB). Ihis allows host to target address mappings to be performed 

Target Code Area Sisse 

The si2e of the TCA depends on the amount of target microcode that needs to be translated 
for the processor. Hie TCA size is automatically scaled to a suitable size. The size of die TCA 
influences the setting of the Targpt Code Mask (TCM) . The TCM must be a mask that causes 
host addresses to be mapped within the TCA Thus the number of set bits witiiin the TCM 
represents a power of 2 size ^ch is the one just smaller than the actual size of the table. The 
reachable size of the TCA is tpade as large as possible to reduce tiie probability of address 
collisions. 

Most of the words within die TCA are used to hoU microcode for die processor. These 
words are 32 bits in wkitfi even thou^ die actual execution word size of the processor may 
be wider. Individual execution words are subdivided into 32 bit words for storage within the 
TCA. A type tag stored within each word allows microcode and other data ^pes to be 
interspersed. 

Certain words within the TCA are used to hold address mappings. These are present to 
support the transformation of host addresses into target addresses. Such transformations are 
required in order to albw function returns and indirect function calls using host addresses. 
When used for this purpose the mapping is referred to hereafter as an Address link. The 
fnftppmgg are also accessed by the dd^ug unit when it needs to map a host breakpoint address 
into an equivalent taiget code address. Such a m^pkig is refisrred to hereafter as a Debug 
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link The m^pings must be pkced at particular locations in the TCA, since they are part of a 
hflsVi table- Thus other data types are placed around the mappngs. A type tag stored ^ihin 
each ^rd allows mapping and oiier data types to be interspersed 

Mapping Process 

Figure 3 ilhastrates the mapping process that is used to ttansfiDrm an input host address into an 
address within the Targ^ Code Area, This is used for accesang die Address Link and Debug 
TJnlc information &6m a host address. 

Firsdy, the host address 301 is masked with die Target Code Mask (TCM) 302 using die 
hardware 303. This masks ofif die address so diat it is widiin the size rangp of the target area. 
The number of least significant bits that are set in .the TCM will be dependent upon die target 
area size. The lower 2 bits of die TCM are always reset, as all si5)plied host addresses must be 
word aligned as all host instmctions are word aligned 

The masked value is then added to die Taiget Code Base (TCB) 304 using die adder 306. This 
is a fixed base value diat gives die location of die Taiget Code Area in die virtual address 
space of the host processor. It is set via a register witiiin the Bus Intetfece Unit After the 
addition the address 305 \wili be within the range of the Taiget Code Area. 

Address linking 

The address linldng mechanism allows host addresses to be used for indirect function calls 
and fiinction call returns. By using the host addresses the data stored by the taiget processor is 
compatible widi existing debuggers. 

Fimction Entry Address Link 

The function entry address link mechanism allows indirect calls to be made using the host 
addresses of functions. Indirect function calls are explicitiy supported in most hi^ level 
language. 

Figure 4 illustrates how the mechanism wods. A translation must be made dynamically 
between the host code address space 401 and the address of microcode within die target code 
area 40Z The host code performs an indirect fiinction call 407 udng a calculated function 
address. The destination function is shown as 404. For instance, diis may be as a result of a 
virtual function call in C+4- ^where die function pointer is obtsdned fix>m die virtual function 
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table for the object in question. In genetal it is not possible to determine what set of functions 
any gjKren indirect cafl mig^t reacL Hie code analysis must assume that any indirect call can 
reach any foncdon any^Theie in the code image. 

If the function has been translated to the tatget processor then it vnH have an address link 408 
associated \rith it This afloAvs indirect function calls to be made between functions on the 
target processor. The address link contains the address 405 of die translated form of the 
function 406. Whenever there is an indirect function call in the translated code a special 
address link operation is performed first This performs a mapping 403 from the host 
function address to tixe target address. An indirect call can then be made to the destination 
pointed to by the link Thus all indirect function calls are made doubly iodirect in order to 
reach die tianslated form of the function. If the link mapping does not access a suitable 
address link entry then that indicates that an indirect function call is being made to a function 
that has not been translated 

Return Address Link 

The function call address link mechanism allows a host address, which would be used in the 
ori^nal untranslated program, to be used in the translated version. The return address is 
loaded into the link register by a call instruction in the host code image and tiiis value is 
architecturally visible. The link r^^ster is preserved on the stack firame if the callee function 
makes any further calls. The debu^er reads diese preserved link values in order to generate a 
stack trace back and show the location die calling points rq)resented on the stack. Thus to 
nnaintain coropatibility with debuggers the host link address must be used. 

The retum address link mechanism is illustrated in Hgure 5. In the host code address space 
501 a call 504 is made. This call will load a retum address for the instruction foflowingthe call 
505. That is the address to which execution returns after the caL In the translated code image 
that retum host address 505 has an address link entry 510 associated with it The address Imk 
points to the translated form of the instructions following the original call 509. Ihe translated 
version of the call 508 explicitly bads the link register with the address of the following host 
instruction 505, in the same manner as the origjbal code. In the callee function (not shown), 
the retum instruction (which is essentially an indirect branch to the Unk renter) is converted 
into an address link operation followed by an indirect branch. The map address link obtains 
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the addtess via the tnapping 506 to obtain the content of the addtess link. Ihe folowing 
indirect branch then initiates execution at 509 after die translated call site. 

This mechanism allows the host return addresses to. be xised and thus fuD compatibility 
maintained vnth debu^ers for the host architecture. The only cost is die requirement to 
explicitly load the link r^jfeterwidi an immediate address before a call and an extra map link at 
the point of a function return. 

Debug linking 

Debug links are placed into die Target Code Area in order to siq)port die debug of translated 
code. There is at least one debug link for each atomic btock in the taiget code. Thus the 
number of debug links will generally be much greater than die number of address links in the 
Target Code Area. They .provide a mapping from a host address to a particular execution 
word. That execution word represents die start address of an atomic block. 

By providing debug ItnIrR at atomic block gtanuladty it is possible to provide breakpoints that 
are only activated if a particular path throvi^ die code is taken. Each atomic block represents 
a particular sequence of conditionally executed code. Qnty one debug link needs to be 
provided for each atomic block since die breakpoint can occur at the start of the atomic block 
and then code can be executed on the host processor to advance the execution point to the 
exact breakpoint This significantiy reduces die number of links that are required in the Target 
Code Area. 

TInk C nlltsinns 

Address and D ebug Links are placed at bcatbns in the Taiget Code Area that are determined 
by the least significant bits of die host address. This is a simple hash table rqwresentation that 
sinoqpty uses tiiese bits as die hashing fimctioiL Given diis address scheme it is posable that 
multiple Address or Debug links may map to the same location in die Taiget Code Area. 
Thus a mechanism is required to handle such collisions. The Targpt Code Area is made as 
laige as possible to reduce the number of coOisbns. 

A fink collision CTamjsIp is shown in Figure 7. The host code address space is shown 701 with 
the requirement for two address links associated with the instructions 703. Botii of diese 
instruction addresses map 704 to die same address link 710. These addresses msp to die same 
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tocatioa in tbe Tai^ Code Atea because die host addresses share all the same least significant 
bit values that ate not masked by the TCM. 

The collision is detected and a Collision Pointer 710 is placed in the Target Code Area 702. 
The purpose of the collision pointer is to point to another area of memory in the Target Code 
Area that holds all the Address or Debug links diat mapped to the same initial location. The 
tpper bits of the Collision Pointer 709 hold a count of the total number of entries in the 
indirect collision sequence. The Collision Pointer as an oflfeet address 708 to die indirect 
sequence of links 705 via the address 707. The indirect sequence itself consists of a number of 
Address or Debug links. They are marked as Address or Debug links via tiieir tags 706. 
These have a special flag bit indicating that th^ are obtained indirecdy via the Collision 
Pointer. These avoids them being incortectfy used a$ Address or Debug links for the 
locations to vMdi they are allocated. All of die indirect Address or Debug links are 
considered to be assodated widi the host address of die Collision Pointer. Note diat die 
indirect sequence of links may be interspersed with direct Bnks. The two can be diflSarentiated 
by use of the flag bit 

Taiget Data Format (TDF) 

This section describes the different types of data that can be represented in die Target Code 
Area. This is illustrated m Figure 8 showing die possible TDF types. Each of die data types is 
32 bits in size and is distingudshed using a 2 bit type tag 814. 

Type 801 represents a word of microcode stored in 805. Type 802 represents an address link. 
The bits 808 provkle an offeet in the target code area. The number of bits allocated to 808 
dq>ends on die size of die target code area. The bits 807 provide a tag comparison agamst a 
niimber of die bits not used to index die location in die taiget code area. The bits 806 provide 
various control attributes of die destination code. Type 803 represents a debug Imk It has a 
very similar format to an address link Bits 811 provide die offset, bits 810 are for comparison 
and bits 809 provide control attributes. 

Taiget Address Fcmnat (lAF) 

The TAP is used as a common format for transferring destinatfon addresses. The 
representation allows bodi host and taiget addresses to be specified in a single format This is 
a requirement to allow host addresses to be specified vHiea calls or branches are made to code 
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diat has not been translated. Moteovet, if a host to tatg^ address translation &ils then diis 
format aDo\?s the host address to be retained Hius an q)propmte host continviadon address 
can be generated if such a branch is taken. 

The format of the TAF is shown in Figure IZ It is designed to be a dose subset of die TDF 
to allow simple transformation of address Unks obtained in TDF. 

Type 1201 rq)tesents a host instruction address stored in bits 1202. Hie lower two bits 
contain the tag of 00. Thus instmcdon addresses must be word aligned. Ihis is a property that 
is generally true for 32 bit RISC architectures. I^e 1203 represents a tatget address. The bits 
1205 gKren die actual address of target code to execute and bits 1204 gives the control 
attributes. 

DdbiJgEuvjtcuuxieot 

Before an application is ever run on real hardware it will have been tested in a simulation 
environment This allows full cyde and bit accurate testuig. Stimulus and behavioural 
modelling code will be produced to emulate the physical environment that the application will 
be executed widun. This process will allow the discovery of most major bugs in the 
app£catioa Since the simulation runs natively using a C++ environment; the engbeer is able 
to use his or her fevourite debugger and int^jtated devdopment environment 

Of course, there are always likety to be application levd bugs that only manifest themsdves in 
the real hardware environment To albw easy analysis of these, die preferred embodiment 
supports a powerful debtig environment 

The overall debug architecture is illusttated in Figure 10. A remote system 1006 communicates 
with the taiget system via a serial or paralld link 1010. A serial link may be used since hig^ 
data speeds are not required and there is a need to minimise the area that the debug hardware 
occupies. A remote debuggjr^ protocol is run over the link. The remote debugger can* send 
commands to die system to set breakpoints, read/write memory and read/write raters etc. 
The remote debu^er will be compatible with the instruction set of the host processor in the 
system. The physical interfece 1005 links to die blocks within the system. Typica% the 
physical interfece will be compatible with JTAG. 
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The host ptocessor in the system 1001 will contain a debug conttol unit 1003 connected to 
the debug channel 1007. Typicalfy the debug conttol unit contain status and bteabpoint 
t^^sters. Breakpoint registers aUow execution to be halted ac a particular instruction address. 
The host processor will connect to a number of coprocessors 1002 via a system bus or 
coprocessor inter&ce 1009. The coprocessors are running code that has been translated &om 
the same executable being run by the host processor. Each coprocessor will contain a debug 
control unit 1004. Hiese may snoop data 1008 &om the same debug channel as die host 
processor. 

Breakpoint settings intended for the host processor can be detected by the debug control 
units 1004. These breakpoints will initially be specified as code addresses relating to the 
location of functions on llie host processor 1001. Hie debug control units will use die address 
linking and debug linking mechanisms to translate those into an address in the translated code 
of a coprocessor. If the function is not mapped to the coprocessor then no mapping \rill be 
bcated and thus no breakpoint will be set ' 

The coprocessor contains a number of brealqx)int reg^ters in the debug unit 1004. These are 
set with the result of the address Hnking process. These cause die processor to halt if the target 
code position of the breakpoint is reached Execution is halted if a particular atomic block is 
reached. This allows breakpoints diat halt the machine on the equivalent of a particular host 
instruction in the code. - 

If the program execution were to be stopped on a breakpoint on the boundary between 
atomic blocks then all the important register and memory state woiold be the same that 
observed on the host architecture. Of course, brealqxjints can be set on any host instruction. 
Reducing thp size of an atomic block to a single host instruction would dramatically reduce 
optimisation opportunities and thus the performance of the processor. 

Breakpointing on an individual host instruction is achieved as follows. A breakpoint is set by 
specifying a host instmction on which to halt This is converted, using the previously 
described debug linking means, into the address of a particular atomic block in the translated 
code. A breakpoint is set on the coprocessor at the start of that particular atomic block. Ibis 
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atomic block will be the one immediate^ piecediog the translation of the lequiced host 
instruction. 

When the breakpoint is detected the execution is continued back onto the host processor. The 
i^ter and any modified memory state heki on the coprocessor is sent back to the host 
processor. The host processor will have had exactty the same breakpoint set Execution on the 
host processor "will continue fix)m the first instcuction associated with the breakpointed atomic 
bkKk. Execution then continues instruction-by-instcuctk)n until the precise breakpointed 
instruction is reached. In this manner the instmction level state at the brealq)oint can be 
rq)roduced &om a combination of state generated by the coprocessor and the host processor 
itsel£ 

To allow higji levels of parallelism in the architecture, code can be scheduled out-of-order 
with respect to the original sequential code. Results may be generated in a completely diflferent 
order to the way they are expressed in the original sequential code. The user should not need 
to be aware of this. When they are debuggbg the code and single stepping dirough it they 
expect expressions to be evaluated and results produced in the sequential order expressed in 
the original sequential code. 

Executable Update 

In the prefeered embodiment a specialised coprocessors may be generated automatical^. 
These shodd interact with the host processor in the system in as seamless a manner as 
possible. The software application shodd be able to run across the combinatbn of both the 
host processor and the coprocessors in the system. Certain software functions are marked for 
execution on a coprocessor. Whenever die functbn is called on the host processor the 
execution flow should be automatical^ directed to the coprocessor. 

To tibis end the origbial host code executable is modified automatically in the preferred 
embodiment The initial instmctions in the host code for functions that are being mapped to a 
coprocessor are modified to load the address of the fimction on the coprocessor and branch 
to a common handling function. This handling function is responsible for communicating 
with the coprocessor. Certain aspects of the host processor state (such as tiie r^jsters) may be 
transferred across to the coprocessor. The coprocessor execution is then initiated fi:om the 
required address to execute the translated function. When the function execution is completed 
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the state is tcansfetred back to the host processor. Executbn may then continue on the host 
processor. Advantageously, this provides the eflfect of a transparent oflBoad of the function 
onto a coprocessor. 

System AixMecture 

This section describes the options for the system architecture of the preferred embodiment It 
is desirable to provide a shared memory environment where the coprocessors can access the 
same address space as the host processor. This allows pointers to be ftedy passed between the 
two environments and allows complex data structures to be shared. 

Providing a shared memory environment adds hardware complexity, as caches are required 
within the coprocessor that must remain coherent with contents of odier caches in the system. 

There are two possible interaction models for the host processors and the coprocessors as * 
detailed bebw: 

HloddngModd 

In the simplest configuration the host processor is blocked while the coprocessor is executing 
functions. An illustradon of this architecture is gjven in Figure 11. A host processor 1101 
contains a cache 1108 and also an inter&ce to the system bus 1107, The main memory 1103 
will be connected to the processor using die system bus 1110. Optionally, the host processor 
may have a specialised coprocessor Inter&ce 1109. A coprocessor 1102 may be connected to 
the host processor either via the system bus and a bus interfiice unit 1105 or via an optional 
inter&ce to a coprocessor port 1106. 

It is expected that the host processor contains a cache 1104. For gpod performance such 
caches normally include use a write-back rather than a write-through caching mechanism. 
Thus data that has been i^xlated is only written back to main memory when the cache line 
needs to be evicted. 

The coprocessor is implemented as a slave to the host processor. Each coprocessor is 
allocated a block of raters in the address map of the bus. These r^jsters can be accessed by 
software running on the processor. Transmission of data fix)m the coprocessor to the host 
processor is performed via the host processor reading registers stored within the interfece. 
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Hie coptocessors may also have the capability to generate an intemapt to the host processor 
in Older to handle a &i^n \ ^ event or something outside of the normal commumcation 
protocol 

In this model all memory accesses are directed via the host processor. Hiis allows all addresses 
handled by the coprocessor to be virtual Thus the cache 1104 is indexed using virtual 
addresses. Memory addresses supplied to the host processor are automatically translated into 
physical addresses using the address translation mechanisms already implemented by the host 
processor. 

When the host processor timeline 1112 encounters a function that is being executed by the 
coprocessor the register state is passed to the coprocessor 1114, The coprocessor 1113 shoyrs 
the execution of the functions. As soon as the initiation is receded &om the host processor 
the coprocessor leaves its sleep state 1119. While the coprocessor is running the host 
processor is blocked 1116 waiting for requests from the coprocessor. During its execution the 
coprocessor will initiate requests 11 17 if there are cache misses. These requests will be handled 
by the host processor and those which cannot be satisfied from data in the coprocessor will 
result in transactions to the main memory 1118. The main memory timeline 1111 shows the 
memory being idle 1120 unless it receives a transaction request 

When the end of the function executed is reached any dirty data in the cache 1104 is written 
back 1115. The coprocessor can then re-enter its sleeping state 1119. 

NonrBlocking Ltxqdemeatation 

The non-blocking model provides a more complex interaction between the host processor 
and the coprocessor. In this model the host processor may continue and perform other tasks 
while the coprocessor is operational This relies on the coprocessor being able to become a 
bus master and initiate memory accesses direcdy. Since the coprocessor must be able to 
initiate memory accesses using physkal addresses it needs to be able to perform a virtual to 
physical address transktion. 

Tte model relies on the use of threads in the application program ru nning on the host 
processor. When a particular thread encounters a function that has been mapped onto a 
particular coprocessor the diread eflSectively transfers onto die coprocessor. The host 
processor is therefore freed to continue running other threads. 
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An example configutation is shown in Figute 9. A ho st processor 901 contains a bus intei&ce 
unit 904 that intet&ces to die system bus 908. The main memory 903 is connected to the 
system bus. Hie ho st processor also contains a cache 905 and a Translation Looka^e Bufifet 
(TLB) 919. Ihis contains a cache of ttansladons between vklxel addresses and physical 
addresses in the memory system. A coprocessor 902 is connected to the host processor via a 
bus inter&ce unit 906. Hie coprocessor also contains a wdte back cache 904. 

The memory address ttansktions must be coherent 920 between the host processor and a 
TLB held by the coprocessor in the bus inter&ce unit 906. This TLB is used for map|ring 
virtual to physical addresses heSotc initiating a memory transaction with the main memory 
903. 

A shared virtual memory system also requires management of the entries within the TLB. In 
this configuration it is assumed that the host processor is running an operating system that 
determines vAicn to page in and page out particular blocks of virtual memory in die physical 
address space. Whenever there is a miss in the TLB of a coprocessor an interrupt to the host 
processor may be generated This causes the required virtual page to be looked up in the page 
tables (bringing the data in ficom secondary storage if required) and the physical page address 
transmitted to the coprocessor where it is stored for future usage in the TLB. Moreover, if a 
physical page is ever reclaimed by the operating system for vise by another virtual page then 
the corresponding entries in all coprocessor TLBs must be evicted. This is done using a 
broadcast message sent firom the host processor. Thus diis mechanism requires changes to be 
made to the memory management handling routines within the kernel of the operating 
system. 

The host processor timeline 910 is shown executing a first thread 914. If this thread 
encounters a fiinction that should be executed on the coprocessor then any dirty data in the 
host processor , cache is first written back to mab memory 913. The coprocessor is then 
initiated by transferring register state across to it 917. The coprocessor timeline 911 is diverted 
at that point to start execution 912 of the fimctions. The host processor initiates anodier 
thread 915 that may execute vrfiile the first thread is being executed on the coprocessor. Cache 
misses 918 in the coprocessor initiate direct transactions with the main memory 909. Issues of 
coherence between the host processor and coprocessor are dealt with by the standard thread 
synchronisation requirements for shared memory. When the execution of the coprocessor 
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functions is complete any ditty data can be wdtten back tx> main memoty 916 and the host 
processor is abk to ptoceed ^vnth the o]jg^lal thread 914. 



It is understood that diere are many possible ahemative embodiments of the invention. It is 
recognized that the description contained herein is only one possible embodiment This 
should not be taken as a limitation of the scope of the invention. The scope should be defined 
by the claims and wt therefore assert as our invention all dmt comes ^within the scope and 
spirit of those daims. 
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CLAIMS 

1. A method of automatical^ configuring a mictoprocessot atchitectute, comprising the 
step of using executable code for anotiiet type of microprocessor, 

2. The method according to claim 1 \^eteby the configuted processor is used as a 
coprocessor in a system. 

3. The method according to claim 2 whereby die host processor in the system is of the 
type of die executable used to configure the coprocessor, 

4. The method according to daim 3 \^e£eby die host processor in a system runs die 
executable used tx> configure the coprocessor. 

5. The method according to daim 4 \diereby a number of individual software functions 
in the executable code are madced for ttansladon and execution on die coptocessor. 

6. The method according to daim 5 hereby die original executable imagp is 
automatical^ modified so diat function calk to those translated functions cause an 
equivalent function to be executed on the coprocessor. 

7. The mediod according to daim 6 wheteby the coptocessor initiation involves the 
tians&r of reg^ter state fix>m die host ptocessor to tbe coptocessor. 

8. The method according to daim 7 \rfiereby the completion of a function on the 
coptocessor causes the transfer of register state &om the coprocessor to the host 
processor. 

9. The method according to claim 7 whereby the completion of a function on the 
coprocessor causes the transfer of memory state firom the coprocessor to the host 
processor. 

10. The method according to daim 1 whaxhy the architecture generated is des^ned to 
execute parts of the executable with hi^er performance than can be achieved with 
die host processor. 
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11. Tbe method accoidiag to claim 10 ^Thereby the imptoved petfomiaace is obtained by 
the execudon of mote operations in paiaM than is achieved \vith the host pjxKessot . 

IZ The method accotding to claim 1 \daereby the architecture generated is designed to 
execute parts of the executable with bwer power consumption than can be achieved 
vrith the host processor. 

13. Ihe method according to 1 \^eteby the executable code is translated into die 
instruction set of the configured processor. 

14. The method according to claim 13 vrfiereby each instruction in the executable image 
is translated into one or more basic operations. 

15 The method according to claim 14 vrfiereby each of these operations may be 
performed by particular execution unit that is present in the configured processor. 

1 6. The method according to claim 1 5 vdiereby the r^jster file is present as an execution 
unit in the ardiitecture and e^lidt operations to read and wdte the register file is 
generated as part of the tcanslatioa 

17. The mediod according to claim 16 whereby static r^ter analysis may be used to 
diminate unnecessary writes of raters. 

18. The method according to daim 17 \rfiereby code is subdivided into atomically 
executed blocks. 

19. Hie mediod according to daim 18 whereby each atomically executed blodt 
reproduces die operations of die corresponding host code. 

20. The method according to claim 19 whereby die state of live raters at die end of the 
atomic block execution is identical to diat obtained fix>m execution on the host 
processor. 

21. The method accotding to claim 19 whereby the state of the memory at the end of the 
atomic block execution is identical to that obtained firom execution on the host 
processor. 
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22. The method according to claim 1 bteakpoints may be set on the configured 
architecture. 

23. Hie method according to claim 2 \yhereby the breakpoints may be specified using the 
addresses of instructions in the original executable, 

24. The method according to daim 3 \rfiereby the nearest precediiig instruction for \Aich 
state can be synchronised on the configured processor is determined Tvhen die 
breakpoint is set 

25. The mediod according to daim 24 hereby the configured processor contains a 
mechanism to determine the equivalent taxgpt instruction address tot a host 
instruction address. 

26. The method according to daim 25 ^dieteby the configured processor contains 
hardware to cause a breakpoint halt on the required address that prevents any side 
effects caused by sequentially later instructions. 

27. The mediod according to ckum 26 "vrfiereby upon detection of a breakpoint execution 
can be continued on the host processor fix>m the syachronised address until the point 
of the actual breakpoint 

28. The mediod according to daim 23 \^ereby die breakpoint address is determined by 
decoding the data stream on the debug inter&ce to the host processor. 

29. »The method according to daim 1 whereby certain host processor instruction 
addresses may be converted to taigpt processor addresses 'while the system is ru nn i ng . 

30. The method according to daim 29 \rfiereby a hashing table is maintained in memory 
to perform a mapping of certain host processor instruction addresses to target 
addresses. 
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The mediod according to daim 30 \rfiereby the moping information may be 
intedeaved with the machine code for the target processor. 
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32 lie method accordingly claim 31 >3Aeirf>ytte State 

of a table are used to indicate MiAiether die information represents an address mapiriog 
entry or target machine code. 

33, The method according to daim 1 \diereby function calls in die executable code may 
be replaced vnih uses of particular hardware blocks in the configured processor. 

'34. The method according to daim 33 \rfiereby the input parameters to the software 
function correspond to the operands supplied to the cortespondii^ hardware unit 

35. The method according to claim 34\rficreby die reference parameters and return result 
ftom a software function correspond to die results generated by die coaesponding 
ha^lwareunit 

36. The method according to claim 33 whereby die original software implementation may 
be used as a behavioural modd for hardware for die purposes of simulation. 

37. The mediod according to claim 1 whereby an instmction set translator converts 
• instructions from the executable image into behaviourally equivalent operations diat 

are mapped to the target processor, 

38. The mediod according to daim 4 whereby die coprocessor contains one or more 
cache memories. 

39. The mediod according to claim 38 whereby die coprocessor and host processor 
communicate via a system bus or a gpneric coprocessor interfece on die host 
processor. 

40. The mediod according to daim 39 whereby die host processor services memory 
access requests fix)m the coprocessor while the coprocessor is operational 

41. The mediod according to daim 40 vrfaereby die coprocessor is able to flush its caches 
of all modified data when the end of function execution is reached. 

4Z The method according to daim 39 whereby a copy of some virtual to physical pag^ 
TTiflppings are maintained by the coprocessor. 
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32 

43. A microprocessor that has been automatical configured using the method as defined 
in any preceding daim 1 — 4Z 



I 
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Diagtams 

Figiuel 
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;io2 



103 




F]giue2 



unsigned count_bits (unsigned x) { 
int i, counts- 



count = 0; 

for (i « 0; i < 32; i++) { 
if (X & (1 «.i)) count++; 

} 

return count; 



> 201 



) 



function 0 { 

count « count_J>its (sample) ; 
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